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Imposition of a lasso penalty shrinks parameter estimates toward 
zero and performs continuous model selection. Lasso penalized re- 
gression is capable of handling linear regression problems where the 
number of predictors far exceeds the number of cases. This paper 
tests two exceptionally fast algorithms for estimating regression co- 
efficients with a lasso penalty. The previously known £2 algorithm is 
based on cyclic coordinate descent. Our new £\ algorithm is based on 
greedy coordinate descent and Edgeworth's algorithm for ordinary 
£\ regression. Each algorithm relies on a tuning constant that can be 
chosen by cross-validation. In some regression problems it is natural 
to group parameters and penalize parameters group by group rather 
than separately. If the group penalty is proportional to the Euclidean 
norm of the parameters of the group, then it is possible to majorize 
the norm and reduce parameter estimation to £2 regression with a 
lasso penalty. Thus, the existing algorithm can be extended to novel 
settings. Each of the algorithms discussed is tested via either simu- 
lated or real data or both. The Appendix proves that a greedy form 
of the £2 algorithm converges to the minimum value of the objective 
function. 

1. Introduction. This paper explores fast algorithms for lasso penalized 
regression [Chen et al. (1998), Claerbout and Muir (1973), Santosa and Symes 
(1986), Taylor et al. (1979) and Tibshirani (1996)]. The lasso performs con- 
tinuous model selection and enforces sparse solutions in problems where the 
number of predictors p exceeds the number of cases n. In the regression set- 
ting, let yi be the response for case i, xij be the value of predictor j for case 
i, and (3j be the regression coefficient corresponding to predictor j. The in- 
tercept [i is ignored in the lasso penalty, whose strength is determined by the 
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positive tuning constant A. Given the parameter vector 9 = . . . , j3 p ) 

and the loss function g(B), lasso penalized regression can be phrased as min- 
imizing the criterion 



where g{6) equals J27=i I Hi ~ M ~~ Ej=i x ijflj I f° r regression and equals 
I E"=i(yi - A* - Ej=i aty-/?,-) 2 for £ 2 regression. 

The lasso penalty AE?=i shrinks each /3j toward the origin and tends 
to discourage models with large numbers of marginally relevant predictors. 
The lasso penalty is more effective in deleting irrelevant predictors than a 
ridge penalty \J2*j=i ft] because |6| is much bigger than b 2 for small b. When 
protection against outliers is a major concern, l\ regression is preferable to 
£2 regression [Wang et al. (2006a)]. 

Lasso penalized estimation raises two issues. First, what is the most effec- 
tive method of minimizing the objective function (1)? Second, how does one 
choose the tuning parameter A? Although the natural answer to the second 
question is cross-validation, the issue of efficient computation arises here as 
well. We will discuss a useful approach in Section 6. The answer to the first 
question is less obvious. Standard methods of regression involve matrix di- 
agonalization, matrix inversion, or, at the very least, the solution of large 
systems of linear equations. Because the number of arithmetic operations for 
these processes scales as the cube of the number of predictors, problems with 
thousands of predictors appear intractable. Recent research has shown this 
assessment to be too pessimistic [Candes and Tao (2007), Park and Hastie 
(2006a, 2006b) and Wang et al. (2006a)]. In the current paper we highlight 
the method of coordinate descent. Our reasons for liking coordinate descent 
boil down to simplicity, speed and stability. 

Fu (1998) and Daubechies et al. (2004) explicitly suggest coordinate de- 
scent for lasso penalized £2 regression. For inexplicable reasons, they did not 
follow up their theoretical suggestions with numerical confirmation for highly 
underdetermined problems. Claerbout and Muir (1973) note that lasso pe- 
nalized £\ regression also yields to coordinate descent. Both methods are 
incredibly quick and have the potential to revolutionize data mining. The 
competing linear programming algorithm of Wang et al. (2006a) for penal- 
ized t\ regression is motivated by the problem of choosing the tuning pa- 
rameter A. Their algorithm follows the central path determined by the min- 
imum of f(9) as a function of A. This procedure reveals exactly when each 
estimated (3j enters the linear prediction model. The central path method 
is also applicable to penalized £2 regression and penalized estimation with 
generalized linear models [Park and Hastie (2006b)]. 
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Besides introducing a modification of the £\ coordinate descent algo- 
rithm, we want to comment on group selection in £2 regression. To set the 
stage for both purposes, we will review the previous work of Fu (1998) and 
Daubechies et al. (2004). We approach £\ regression through the nearly for- 
gotten algorithm of Edgeworth (1887, 1888), which for a long time was con- 
sidered a competitor of least squares. Portnoy and Koenker (1997) trace the 
history of the algorithm from Boscovich to Laplace to Edgeworth. It is fair to 
say that the algorithm has managed to cling to life despite decades of obscu- 
rity both before and after its rediscovery by Edgeworth. Armstrong and Kung 
(1978) published a computer implementation of Edgeworth's algorithm in 
Applied Statistics. Unfortunately, this version is limited to simple linear re- 
gression. We adapt the Claerbout and Muir (1973) version of Edgeworth's 
algorithm to perform greedy coordinate descent. The resulting £\ algorithm 
is faster than cyclic coordinate descent in £2 regression. 

Many data sets involve groups of correlated predictors. For example, in 
gene microarray experiments, genes can sometimes be grouped into biochem- 
ical pathways subject to genetic coregulation. Expression levels for genes 
in the same pathway are expected to be highly correlated. In such situa- 
tions it is prudent to group genes and design penalties that apply to entire 
groups. Several authors have taken up the challenge of penalized estimation 
in this context [Zou and Hastie (2005), Yuan and Lin (2006) and Zhao et al. 
(2006)]. In the current paper we will demonstrate that cyclic coordinate de- 
scent is compatible with penalties constructed from the Euclidean norms of 
parameter groups. We attack penalized estimation by combining cyclic co- 
ordinate descent with penalty majorization. This replaces the nonquadratic 
norm penalties by l\ or £2 penalties. The resulting algorithm is reminiscent 
of the generic MM algorithm for parameter estimation [Lange (2004)]. 

In the remainder of the paper Section 2 reviews cyclic coordinate descent 
for penalized £2 regression, and Section 3 develops Edgeworth's algorithm 
for penalized £\ regression. Section 4 briefly discusses convergence of coordi- 
nate descent in penalized £2 regression; the Appendix proves convergence for 
greedy coordinate descent. Section 5 amends the £2 algorithm to take into 
account grouped parameters, and Section 6 gives some guidance on how to 
select tuning constants. Sections 7 and 8 test the algorithms on simulated 
and real data, and Section 9 summarizes their strengths and suggests new 
avenues of research. 

Finally, we would like to draw the reader's attention to the recent paper 
of Friedman et al. (2007) in this journal on coordinate descent and the fused 
lasso. Their paper has substantial overlap and substantial differences with 
ours. The two papers were written independently and concurrently. 

2. Cyclic coordinate descent for £2 regression. Coordinate descent comes 
in several varieties. The standard version cycles through the parameters and 
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updates each in turn. An alternative version is greedy and updates the pa- 
rameter giving the largest decrease in the objective function. Because it is 
impossible to tell in advance which parameter is best to update, the greedy 
version uses the surrogate criterion of steepest descent. In other words, for 
each parameter we compute forward and backward directional derivatives 
and then update the parameter with the most negative directional deriva- 
tive, either forward or backward. The overhead of keeping track of these 
directional derivatives works to the detriment of the greedy method. For l\ 
regression, the overhead is relatively light, and greedy coordinate descent is 
substantially faster than cyclic coordinate descent. 

Although the lasso penalty is nondifferentiable, it does possess directional 
derivatives along each forward or backward coordinate direction. For in- 
stance, if efc is the coordinate direction along which (3k varies, then the 
objective function (1) has directional derivatives 

and 

d. e J{9) = hm = d. ek g{6) + | ^ ^ < Q. 

In £2 regression, the function g(9) is differentiable. Therefore, its direc- 
tional derivative along coincides with its ordinary partial derivative 

d n ( P \ 

i=l \ j=l I 



and its directional derivative along — coincides with the negative of its 
ordinary partial derivative. In l\ regression, the coordinate direction deriva- 
tives are 

yi - fj, - x\f3 > 0, 
de k g(0) = V { x ik , yi - ii - xl/3 < 0, 




yi - fi - xj/3 = 



yi — (i — x\P > 0, 
d -e k g{0) = Y] { -x ik , yi-fj,-xl(3< 0, 




and 



yi- H-x\f3 = 0, 

where x\ is the row vector (xji, . . . , Xi p ). 

In cyclic coordinate descent we evaluate d ek f{9) and d- ek f(9). If both 
are nonnegative, then we skip the update for [3k- This decision is defensi- 
ble when g(9) is convex because the sign of a directional derivative fully 
determines whether improvement can be made in that direction. If either 
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directional derivative is negative, then we must solve for the minimum in 
that direction. Because the objective function f(Q) is convex, it is impossible 
for both directional derivatives d ek f(6) and d_ ek f(6) to be negative. 

In underdetermined problems with just a few relevant predictors, most 
updates are skipped, and the corresponding parameters never budge from 
their starting values of 0. This simple fact plus the complete absence of 
matrix operations explains the speed of cyclic coordinate descent. It inherits 
its numerical stability from the descent property of each update. 

Fu (1998) derived cyclic coordinate descent algorithms for £2 regression 
with penalties XJ2j\Pj\ a with a > 1. With a lasso penalty (a = 1), the 
update of the intercept parameter can be written as 



For the parameter there are separate solutions to the left and right of 0. 
These amount to 



Only one of these two solutions can be nonzero. The partial derivatives 



of g{9) are easy to compute if we keep track of the residuals ri = yi — \i — x\f5. 
The residual ri is reset to r j + \i — fi when \i is updated and to r j + Xik (Pk~ $k) 
when Pk is updated. Organizing all updates around residuals promotes fast 
evaluation of g{6). 

3. Greedy coordinate descent for t\ regression. In greedy coordinate 
descent, we update the parameter 9k giving the most negative value of 
min{df ek (6),df- ek (9)}. If none of the coordinate directional derivatives is 
negative, then no further progress can be made. In lasso constrained l\ re- 
gression greedy coordinate descent is quick because directional derivatives 
are trivial to update. Indeed, if updating does not alter the sign of the 
residual r i = — \x — x\(3 for case i, then the contributions of case i to the 
various directional derivatives do not change. When the residual r% becomes 
or changes sign, these contributions are modified by simply adding or sub- 
tracting entries of the design matrix. Similar considerations apply when \i 
is updated. 
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To illustrate Edgeworth's algorithm in action, consider minimizing the 
two-parameter model g(9) = Ya=i \Vi ~ t 1 ~ x iP\ with a single slope (3. To 
update fi, we recall the well-known connection between i\ regression and 
medians and replace (i for fixed (3 by the sample median of the numbers Z{ = 
Hi — Xif3. This action drives g(6) downhill. Updating j3 for fi fixed depends 
on writing 



i=l 



sorting the numbers Z{ = (jji — fi)/xi, and finding the weighted median with 
weight Wi = \xi\ assigned to z%. We replace (3 by the order statistic zm whose 
index i satisfies 

i— 1 n in 

E w m < \ E w m > E w m ^ \ E w m ■ 

i=i j=i j=i i=i 

With more than a single predictor, we update parameter (3k by writing 



9(9) =E \ Xik \ 

i=l 



and finding the weighted median. 

Two criticisms have been leveled at Edgeworth's algorithm. First, al- 
though it drives the objective function steadily downhill, it sometimes con- 
verges to an inferior point. Li and Arce (2004) give an example involv- 
ing the data values (0.3,-1.0), (-0.4,-0.1), (-2.0,-2.9), (-0.9,-2.4) and 
(—1.1,2.2) for the pairs (xi,yi) and parameter starting values = (3.5,-1.0). 

Unfortunately, Li and Arce's suggested improvement to Edgeworth's algo- 
rithm does not generalize readily to multivariate linear regression. The sec- 
ond criticism is that convergence often occurs in a slow seesaw pattern. 
These defects are not fatal. 

In fact, our numerical examples show that the greedy version of Edge- 
worth's algorithm performs well on most practical problems. It has little 
difficulty in picking out relevant predictors, and it usually takes less com- 
puting time to converge than £2 regression by cyclic coordinate descent. In 
l\ regression, greedy coordinate descent is considerably faster than cyclic 
coordinate descent, probably because greedy coordinate descent attacks the 
significant predictors early on before it gets trapped by an inferior point. 

Implementing Edgeworth's algorithm with a lasso penalty requires view- 
ing the penalty terms as the absolute values of pseudo-residuals. Thus, we 
write \\(3j \ = \y — x t 9\ by taking y = and Xk = Xl{k=j}- Edgeworth's algo- 
rithm now applies. 

Because the l\ objective function is nondifferentiable, it is difficult to un- 
derstand the theoretical properties of t\ estimators. Our supplementary appendix 
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[Wu and Lange (2008)] demonstrates the weak consistency of penalized £\ 
estimators. The proof there builds on the previous work of Oberhofer (1983) 
on nonlinear £\ regression. Since we only consider linear models, it is possi- 
ble to relax and clarify his stated regularity conditions. Zhao and Yu (2006) 
summarize and extend previous consistency results for £2 penalized estima- 
tors. 

4. Convergence of the algorithms. The counterexample cited for Edge- 
worth's algorithm shows that it may not converge to a minimum point. The 
question of convergence for the £2 algorithms is more interesting. Textbook 
treatments of convergence for cyclic coordinate descent are predicated on the 
assumption that the objective function f(9) is continuously differentiable. 
For example, see Proposition 5.32 of Ruszczyhski (2006). Coordinate descent 
may fail for a nondifferentiable function because all directional derivatives 
must be nonnegative at a minimum point. It does not suffice for just the 
directional derivatives along the coordinate directions to be nonnegative. 
Unfortunately, the lasso penalty is nondifferentiable. The more general the- 
ory of Tseng (2001) does cover cyclic coordinate descent in £2 regression, 
but it does not apply to greedy coordinate descent. In the Appendix we 
demonstrate the following proposition. 

Proposition 1. Every cluster point of the £2 greedy coordinate descent 
algorithm is a minimum point of the objective function f{9). If the minimum 
point is unique, then the algorithm converges to it. If the algorithm converges, 
then its limit is a minimum point. 

Our qualitative theory does not specify the rate of convergence. Readers 
may want to compare our treatment of convergence to the treatment of Fu 
(1998). 

It would also be helpful to identify a simple sufficient condition making the 
minimum point unique. Ordinarily, uniqueness is proved by establishing the 
strict convexity of the objective function. If the problem is overdetermined or 
the penalty is a ridge penalty, then this is an easy task. For underdetermined 
problems with lasso penalties, strict convexity can fail. Of course, strict 
convexity is not necessary for a unique minimum; linear programming is full 
of examples to the contrary. Based on a conversation with Emanuel Candes, 
we conjecture that almost all (with respect to Lebesgue measure) design 
matrices lead to a unique minimum. 

5. £2 regression with group penalties. The issues of modeling and fast 
estimation are also intertwined with grouped effects, where we want coor- 
dinated penalties that tend to include or exclude all of the parameters in a 
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group. Suppose that the j3 parameters occur in q disjoint groups and 7^ de- 
notes the parameter vector for group j. The lasso penalty A||7j||i separates 
parameters and does not qualify as a sensible group penalty. For the same 
reason the scaled sum of squares A [| § is disqualified. However, the scaled 
Euclidean norm A||7j||2 is an ideal group penalty. It couples the parameters, 
it preserves convexity, and, as we show in a moment, it meshes well with 
cyclic coordinate descent in £2 regression. 

To understand its grouping tendency, note that the directional derivative 
of ||7j||2 along e^, the coordinate vector corresponding to 7^, is 1 when 
7j = and is when jj 7^ and 7jfc = 0. Thus, if any parameter 7^, / 7^ k, is 
nonzero, it becomes easier for 7^ to move away from 0. Recall that for a pa- 
rameter to move away from 0, the forward or backward directional derivative 
of the objective function must be negative. If a penalty contribution to these 
directional derivatives drops from 1 to 0, then the directional derivatives are 
more likely to be negative. 

In £2 regression with grouping effects, we recommend minimizing the ob- 
jective function 

3=1 3=1 

where g{6) is the residual sum of squares. If the tuning parameter A2 = 0, 
then the penalty reduces to the lasso. On the one hand when Ai = 0, only 
group penalties enter the picture. The mixed penalties with Ai > and 
A2 > enforce shrinkage in both ways. All mixed penalties are norms and 
therefore convex functions. Nonconvex penalties complicate optimization 
and should be avoided whenever possible. 

At each stage of cyclic coordinate descent, we are required to minimize 
g(6) + A 2 1 1 T j 1 1 2 + Ai||tj||i with respect to a component jjk of some jj. If 
7j =0, then H77H2 = |7jfc| as a function of jj^. Thus, minimization with 
respect to jjk reduces to the standard update for £2 regression with a lasso 
penalty. The lasso tuning parameter A equals Ai + A2 is this situation. When 
7j 7^ 0, the standard update does not apply. 

However, there is an alternative update that stays within the framework 
of penalized £2 regression. This alternative involves majorizing the objective 
function and is motivated by the MM algorithm for parameter estimation 
[Lange (2004)]. In view of the concavity of the square root function we 
have the inequality 

(2) Il7ill2<ll7rila + 2ii^(ll7illi-||7rill). 

H7j II2 

where the superscript m indicates iteration number. Equality prevails when- 
ever 7j = jj 1 . The right-hand side of inequality (2) is said to majorize the 
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left-hand side. This simple majorization leads to the additional majorization 
0(0) + A 2 ||7il|2 + Ai||7il|i 

mil i ^ /|l„, \\2 ii ,m ||2s 



<g(6) + X 2 



l^ll2 + 2rar:(Nll2-ll% 

II ij 1 1 ^ 



+ Ai 



As a function of jjk ignoring 7^, the second majorization amounts to a 
quadratic plus a lasso penalty. Fortunately, we know how to minimize such a 
surrogate function. According to the arguments justifying the descent prop- 
erty of the MM algorithm, minimizing the surrogate is guaranteed to drive 
the objective function downhill. 

To summarize, grouped effects can be handled by introducing penalties 
defined by the Euclidean norms of the grouped parameters. Updating a 
parameter follows the standard recipe when the other parameters of its 
group are fixed at 0. If one of the other parameters from its group is nonzero, 
then we majorize the objective function and minimize the surrogate function 
with respect to the designated parameter. Again, the update relies on the 
standard recipe. Although convergence may be slowed by majorization, it is 
consistent with cyclic coordinate descent and preserves the structure of the 
updates. 

6. Selection of the tuning constant A. As we mentioned earlier, selection 
of the tuning constant A can be guided by cross-validation. This is a one- 
dimensional problem, so inspection of the graph of the cross-validation error 
curve c(A) suffices in a practical sense. Recall that in fc-fold cross-validation, 
one divides the data into k equal batches (subsamples) and estimates pa- 
rameters k times, leaving one batch out per time. The testing error for each 
omitted batch is computed using the estimates derived from the remain- 
ing batches, and c(A) is computed by averaging testing error across the k 
batches. In principle, one can substitute other criterion for average cross- 
validation error. For instance, we could define c(A) by AIC or BIC criteria. 
For the sake of brevity, we will rest content with cross-validation error. 

Evaluating c(A) on a grid of points can be computationally inefficient, par- 
ticularly if grid points occur near A = 0. Although we recommend grid sam- 
pling on important problems, it is useful to pursue shortcuts. One shortcut 
combines bracketing and golden section search. Because coordinate descent 
is fastest when A is large and the vast majority of /3j are estimated as 0, it 
makes sense to start looking for a bracketing triple with a very large value 
Ao and work downward. One then repeatedly reduces A by a fixed propor- 
tion r E (0, 1) until the condition c(Afc + i) > c(Afc) first occurs. This quickly 
identifies a bracketing triple Xk-i > Xk > Afe+i with A& = r k Xo giving the 
smallest value of c(A). One can now apply golden section search to minimize 
c(A) on the interval (X^+i, Afc_i). With grouped parameters, finding the best 
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pair of tuning parameters (Ai, A2) is considerably more difficult. As a rough 
guess, we recommend consideration of the three cases: (a) Ai = 0, (b) A2 = 
and (c) Ai = A2. These one-dimensional slices yield to bracketing and golden 
section search. 

Selection of the tuning constant A has implications in setting the initial 
value of 9. For a single A, we recommend setting 9° = and all residuals 
ri = 0. As A decreases, we expect current predictors to be retained and 
possibly new ones to enter. If we estimate 9 for a given A, then it makes sense 
to start with 9 and the corresponding residuals for a nearby but smaller value 
of A. This tactic builds on already good estimates, reduces the number of 
iterations until convergence and saves considerable time overall in evaluating 
the c(A) curve. 

7. Analysis of simulated data. In evaluating the performance of the co- 
ordinate descent methods, we put special emphasis on the underdetermined 
setting p^> n highlighted by Wang et al. (2006a). In the regression model 



we assume that the random errors ej are independent and follow either a 
standard normal distribution or a Laplace (double exponential) distribution 
with scale 1. The predictor vectors X{ represent a random sample from a 
multivariate normal distribution whose marginals are standard normal and 
whose pairwise correlations are 



In every simulation the true parameter values are (3j = 1 for 1 < j ' < 5 and 
(3j = for j > 5. 

The quality of the parameter estimates and the optimal value of A are 
naturally of interest. To ameliorate the shrinkage of nonzero estimates for 
a particular A, we always re-estimate the active parameters, omitting the 
inactive parameters and the lasso penalty. This yields better parameter es- 
timates for testing purposes. The choice of A depends on the 10-fold average 
cross-validation error curve c(A). We sample c(A) on a grid and find its 
minimum by bracketing and golden section search as previously described. 
Given the optimal A, we re-estimate parameters from the data as a whole and 
compute prediction error on a testing data set of 20,000 additional cases. 
It is instructive to compare this approximate prediction error to the true 
prediction error using the true regression coefficients. 

Table 1 reports average prediction errors in t\ regression based on 50 
replicates and problem sizes of (p, n) = (5000,200), (p,n) = (5000,500) and 



p 





j < 10 and k < 10 
otherwise. 
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(p, n) = (50000,500) and both independent predictors (p = 0) and highly 
correlated predictors (p = 0.8). The average number of predictors selected 
is listed as iV nonzero , and the average number of true predictors selected is 
listed as N tTUC . Average computing time in seconds at the optimal value of 
A is recorded in the last column of the table. On our personal computers, 
computing times are remarkably fast even for data sets as large as (p, n) = 
(50000,500). It is clear that the coordinate descent algorithms keep almost 
all true predictors while discarding the vast majority of irrelevant ones. 
Approximate prediction errors are very close to true prediction errors. 

To better understand the impact of the tuning constant A in penalized l\ 
regression, we plot prediction error versus A in the left panel of Figure 1 for 
one realization of the data. Here we take (p,n) = (5000,200), independent 
predictors, and Laplace errors. The solid line shows prediction errors based 

Table 1 

Simulation results for li regression with a lasso penalty. Standard errors of estimates 
appear in parentheses. The left error column is testing error under the true parameter 
values; the right error column is testing error under the estimated parameter values 



/3 = (1, 1,1,1, 1,0,. ..,0) 



Distribution 


(P,n) 


P 


Error 


A 


Error 


^* nonzero 


true 


Time 


Laplace 


(5000,200) 


0.00 


0.99 


44.05 


1.11 


14.06 


5.00 


0.02 










(4.43) 


(0.08) 


(8.63) 


(0.00) 


(0.01) 


Laplace 


(5000, 200) 


0.80 


0.99 


72.39 


1.04 


6.80 


5.00 


0.04 










(13.60) 


(0.03) 


(1.57) 


(0.00) 


(0.01) 


Laplace 


(5000, 500) 


0.00 


1.01 


101.51 


1.01 


5.18 


5.00 


0.09 










(10.58) 


(0.01) 


(0.65) 


(0.00) 


(0.01) 


Laplace 


(5000, 500) 


0.80 


1.01 


132.88 


1.01 


6.22 


5.00 


0.09 










(34.93) 


(0.01) 


(1.27) 


(0.00) 


(0.02) 


Laplace 


(50000, 500) 


0.00 


1.01 


109.21 


1.01 


5.12 


5.00 


0.36 










(9.45) 


(0.01) 


(0.32) 


(0.00) 


(0.04) 


Laplace 


(50000, 500) 


0.80 


1.01 


150.44 


1.01 


6.44 


5.00 


1.59 










(43.68) 


(0.01) 


(1.12) 


(0.00) 


(0.33) 


Normal 


(5000, 200) 


0.00 


0.80 


48.46 


0.84 


7.84 


5.00 


0.03 










(3.58) 


(0.04) 


(3.31) 


(0.00) 


(0.02) 


Normal 


(5000, 200) 


0.80 


0.80 


76.71 


0.82 


6.08 


4.98 


0.04 










(14.44) 


(0.02) 


(0.93) 


(0.14) 


(0.01) 


Normal 


(5000, 500) 


0.00 


0.80 


101.71 


0.81 


5.48 


5.00 


0.05 










(16.98) 


(0.01) 


(1.19) 


(0.00) 


(0.01) 


Normal 


(5000, 500) 


0.80 


0.80 


150.83 


0.81 


6.04 


5.00 


0.10 










(47.28) 


(0.01) 


(1.06) 


(0.00) 


(0.03) 


Normal 


(50000, 500) 


0.00 


0.80 


112.03 


0.81 


5.10 


5.00 


0.75 










(12.03) 


(0.01) 


(0.46) 


(0.00) 


(0.07) 


Normal 


(50000, 500) 


0.80 


0.80 


149.83 


0.81 


6.36 


5.00 


2.02 










(42.17) 


(0.01) 


(1.09) 


(0.00) 


(0.45) 
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on 10-fold cross-validation. The dashed line shows prediction errors based 
on the 20,000 testing cases. It is noteworthy that cross-validation underes- 
timates the optimal value of A suggested by testing error. The optimal A 
based on 10-fold cross-validation is around 45, and the optimal A based on 
20,000 testing cases is around 50. Many statisticians are comfortable with 
this conservative bias of cross-validation. 

Table 2 reports the results of £2 regression under the same conditions 
except for normal errors. The cyclic coordinate descent algorithm for £2 re- 
gression is slightly more reliable, slightly less parsimonious and considerably 
slower than the greedy coordinate descent algorithm for £\ regression. The 
right panel of Figure 1 plots cross-validation error and approximate predic- 
tion error versus A. 

Table 3 compares the speed and performance of three algorithms for lasso 
penalized £2 regression on one realization of each of the simulated data sets 
with normally distributed errors. Because LARS [Hastie and Efron (2007)] 
is considered by many to be the best competing algorithm, it is reasonable 
to limit our comparison of the two versions of coordinate descent to LARS. 
Three conclusions can be drawn from Table 3. First, cyclic coordinate de- 
scent is definitely faster than greedy coordinate descent for £2 regression. 
Second, both methods are considerably faster and more robust than LARS. 
Third, both methods are more successful than LARS in model selection. 
Note that the error estimates in the table for the coordinate descent algo- 
rithms reflect the re-estimation step mentioned earlier. This may put LARS 
at a disadvantage. 

Table 4 compares the greedy coordinate descent and cyclic coordinate 
descent algorithms for £\ regression. The settings are the same as in Table 




D 50 100 150 200 50 100 150 200 250 300 

lambda lamtefa 

Fig. 1. Left panel: Plot of prediction error versus X in l\ regression of simulated data. 
Right panel: Plot of prediction error versus A in £2 regression of simulated data. In both 
figures, the solid line represents the errors based on 10-fold cross-validation, and the dashed 
line represents the errors based on 20,000 testing cases. 
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Table 2 

Simulation results of £2 regression with a lasso penalty. Standard errors of estimates 
appear in parentheses. The left error column is testing error under the true parameter 
values; the right error column is testing error under the estimated parameter values 



f3 = (1,1, 1,1, 1,0,..., 0) 



Distribution 




p 


Error 


A 


Error 


1 ~ nonzero 


true 


Time 


Laplace 


(5000,200) 


0.00 


1.97 


112.17 


2.13 


5.70 


5.00 


0.05 










(16.37) 


(0.15) 


(2.26) 


(0.00) 


(0.00) 


Laplace 


(5000,200) 


0.80 


1.97 


197.84 


2.13 


5.98 


4.86 


0.33 










(90.64) 


(0.15) 


(1.39) 


(0.45) 


(0.13) 


Laplace 


(5000,500) 


0.00 


2.03 


254.62 


2.03 


5.20 


5.00 


0.11 










(97.20) 


(0.06) 


(1.13) 


(0.00) 


(0.01) 


Laplace 


(5000,500) 


0.80 


2.03 


611.90 


2.03 


5.30 


5.00 


0.81 










(180.68) 


(0.04) 


(0.54) 


(0.00) 


(0.14) 


Laplace 


(50000, 500) 


0.00 


2.02 


255.96 


2.03 


5.04 


5.00 


0.91 










(71.50) 


(0.05) 


(0.28) 


(0.00) 


(0.22) 


Laplace 


(50000, 500) 


0.80 


2.02 


588.38 


2.04 


5.64 


5.00 


12.30 










(231.48) 


(0.04) 


(0.87) 


(0.00) 


(5.05) 


Normal 


(5000,200) 


0.00 


1.01 


107.14 


1.03 


5.04 


5.00 


0.05 










(22.11) 


(0.03) 


(0.20) 


(0.00) 


(0.01) 


Normal 


(5000,200) 


0.80 


1.01 


216.30 


1.04 


5.72 


4.98 


0.52 










(90.02) 


(0.04) 


(0.94) 


(0.14) 


(0.13) 


Normal 


(5000,500) 


0.00 


1.01 


240.85 


1.01 


5.02 


5.00 


0.13 










(96.29) 


(0.01) 


(0.14) 


(0.00) 


(0.01) 


Normal 


(5000,500) 


0.80 


1.01 


555.79 


1.01 


5.32 


5.00 


0.71 










(214.27) 


(0.01) 


(0.55) 


(0.00) 


(0.09) 


Normal 


(50000, 500) 


0.00 


1.01 


244.31 


1.01 


5.00 


5.00 


0.98 










(102.34) 


(0.01) 


(0.00) 


(0.00) 


(0.07) 


Normal 


(50000, 500) 


0.80 


1.01 


549.40 


1.01 


5.50 


5.00 


6.40 










(195.81) 


(0.01) 


(0.70) 


(0.00) 


(1.05) 



3 except that the residual errors follow a Laplace distribution rather than 
a normal distribution. The last column reports the ratio of the objective 
functions under the two algorithms at their converged values. Inspection of 
the table shows that the greedy algorithm is faster than the cyclic algorithm. 
Both algorithms have similar accuracy, and their accuracies are roughly 
comparable to the accuracies seen in Table 3 under the heading \\(3 — . 
These positive results relieve our anxieties about premature convergence 
with coordinate descent. 

Some of the competing algorithms for l\ regression simply do not work on 
the problem sizes encountered in the current comparisons. For instance, the 
standard iteratively reweighted least squares method proposed by Schlossmacher 
(1973) and Merle and Spath (1974) falters because of the large matrix inver- 
sions required. It is also hampered by infinite weights for those observations 
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Table 3 






Speed and accuracy 


of different algorithms for 


lasso penalized ^2 regression 


Algorithm 


(Pi n) 


p 


AT nonZ ero 


iVtme Time 


ll/3-/3|li 

M r~" "111 


Cyclic 


(5000, 200) 





5 


5 0.04 


0.51592 


Greedy 






5 


5 0.11 


0.51567 


LARS 






94 


5 2.19 


3.33400 


Cyclic 


(5000, 200) 


0.8 


5 


5 0.18 


1.01544 


Greedy 






5 


5 0.36 


1.01892 


LARS 






35 


5 5.45 


1.48300 


Cyclic 


(50000,500) 





5 


5 0.99 


0.68995 


Greedy 






7 


5 2.90 


0.68700 


LARS 








not available 




Cyclic 


(50000,500) 


0.8 


5 


5 4.11 


0.60956 


Greedy 






5 


5 7.94 


0.60875 


LARS 








not available 




Cyclic 


(500, 5000) 





5 


5 0.24 


0.06338 


Greedy 






5 


5 0.36 


0.06336 


LARS 






27 


5 0.78 


0.25370 


Cyclic 


(500, 5000) 


0.8 


5 


5 0.30 


0.11082 


Greedy 






5 


5 0.71 


0.11049 


LARS 
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5 1.168 


0.18884 



with zero residuals. We were unsuccessful in getting the standard software of 
Barrodale and Roberts (1980) to run properly on these large-scale problems. 

8. Analysis of real data. We now turn to real data involving gene expres- 
sion levels and obesity in mice. Wang et al. (2006b) measured abdominal fat 



Table 4 

Comparison of greedy and cyclic coordinate descent for lasso penalized i\ regression 



Algorithm 


{P,n) 


P 


N 

1 ~ nonzero 


true 


Time 


113-011! 


f greedy 
f cyclic 


Greedy 


(5000, 200) 





7 


5 


0.02 


0.84228 


1.01063 


Cyclic 






7 


5 


0.10 


0.91861 


(A = 50) 


Greedy 


(5000,200) 


0.8 


5 


5 


0.04 


0.53354 


0.99118 


Cyclic 






6 


5 


0.39 


0.74330 


(A = 57.58) 


Greedy 


(50000,500) 





5 


5 


0.34 


0.32212 


1.00288 


Cyclic 






5 


5 


3.99 


0.28565 


(A = 124.5) 


Greedy 


(50000, 500) 


0.8 


7 


5 


1.66 


0.97379 


1.00018 


Cyclic 






8 


5 


8.66 


0.84372 


(A= 110) 


Greedy 


(500,5000) 





5 


5 


0.07 


0.06680 


0.99938 


Cyclic 






5 


5 


1.01 


0.06604 


(A = 1144.14) 


Greedy 


(500,5000) 


0.8 


5 


5 


0.13 


0.12882 


1.00008 


Cyclic 






5 


5 


1.53 


0.12943 


(A = 1144.14) 
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mass on n = 311 F2 mice (155 males and 156 females). The F2 mice were 
created by mating two inbred strains and then mating brother-sister pairs 
from the resulting offspring. Wang et al. also recorded the expression levels 
in liver of p = 23,388 genes in each mouse. 

Our first model postulates that the fat mass yi of mouse i satisfies 

p 

Vi = l{i malcjW + l{i female}/^ + 2_j Xi i&i + £ *' 

where is the expression level of gene j of mouse i and is random error. 
Since male and female mice exhibit across the board differences in size and 
physiology, it is prudent to estimate a different intercept for each sex. The 
left panel of Figure 2 plots as a function of A the average number of nonzero 
predictors and the average prediction error. Here we use £\ regression and 
10- fold cross-validation. The right panel of Figure 2 plots the same quantities 
under £2 regression. 

The 10-fold cross-validation curve c(A) is ragged under both £\ and £2 re- 
gression. For £\ regression, examination of c(A) over a fairly dense grid shows 
an optimal A of about 3.5. Here the average number of nonzero predictors 
is 88.5, and the average testing error is 0.6533. For the entire data set, the 
number of nonzero predictors is 77, and the training error is 0.4248. For £2 
regression, the optimal A is 7.8. Here the average numbers of predictors is 
36.8, and the average testing error is 0.7704. For the entire data set, the 



567.2 166 7t 31 16 ft 7 4 2.2 0.3 3067 7 2.3 1.9 1 1 1 1 1 1 1 1 0.7 0.1 



2 




0.1 19.1 40.2 61.3 92.4 105.6 130.9 156.2 181.5 0.1 19.1 40.2 61.3 S2.4 105.6 130.9 156,2 181.5 



Fig. 2. Left panel: Plot of 10-fold cross-validation error and number of predictors versus 
X inii regression of the mice microarray data. Right panel: Plot of 10-fold cross-validation 
error and number of predictors versus X in £2 regression of the mice microarray data. The 
lower x-axis plots the values of X, and the upper x-axis plots the number of predictors. The 
y-axis is prediction error based on 10-fold cross-validation versus X. 
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number of nonzero predictors is 41, and the training error is 0.3714. The 
preferred i\ and £2 models share 27 predictors in common. 

Given the inherent differences between the sexes, it is enlightening to 
assign sex-specific effects to each gene and to group parameters accordingly. 
These decisions translate into the model 



l{i male} + E Eij (hi J + l{i female} + E X ij02j ) + e * 



V; 



3=1 



v 

j=0 



where the bivariate vectors z%j and jj group predictors and parameters, 
respectively. Thus, 




i is male and j = 0, 
i is female and j = 0, 
i is male and j > 0, 
i is female and j > 0, 



and 7q = (/zi, ^2) and 7^ = (Pij,fhj) for j > 0. In this notation, the objective 
function becomes 

n / p \ 2 g g 

/ (o) = \ E w - E 4^ + A 2 E n^-ib + A! E hi Ik- 

i=i V j=o / j=i j=i 

Under 10-fold cross-validation, the optimal pair (Ai, A2) occurs at approx- 
imately (5, 1). This choice leads to an average of 40.1 nonzero predictors and 
an average prediction error of 0.8167. For the entire data set, the number 
of nonzero predictors is 44, and the training error is 0.3128. Among the 44 
predictors, three are paired female-male slopes. Thus, the preferred model 
retains 41 genes in all. Among these 41 genes, 25 appear in the t\ model, 
and 26 appear in the £2 model without the group penalties. For all three 
models, there are 20 genes in common. 

In carrying out these calculations, we departed from the tack taken by 
Wang et al. (2006b), who used marker genotypes rather than expression lev- 
els as predictors. Our serendipitous choice identified some genes known to be 
involved in fat metabolism and turned up some interesting candidate genes. 
One of the known genes from the short list of 20 just mentioned is pyru- 
vate dehydrogenase kinase isozyme 4 (Pdk4). This mitochondrial enzyme, 
which has been studied for its role in insulin resistance and diabetes, is a 
negative regulator of the pyruvate dehydrogenase complex. The upregula- 
tion of Pdk4 promotes gluconeogenesis. In the coexpression network studies 
of Ghazalpour et al. (2005, 2006), Pdk4 was one of the genes found in the 
module associated with mouse body weight. Here we find that expression 
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of Pdk4 is a good predictor of fat pad mass. It has been suggested that 
enhanced PDK4 expression is a compensatory mechanism countering the 
excessive formation of intracellular lipid and the exacerbation of impaired 
insulin sensitivity [Huang et al. (2002) and Sugden (2003)]. 

A second gene on our list is leukotriene a4 hydrolase. This enzyme con- 
verts leukotriene A4 to leukotriene B4 as part of the 5-lipooxygenase in- 
flammatory pathway, which Mehrabian et al. (2005) report influences adi- 
posity in rodents. Other genes on our list include three involved in energy 
metabolism: Ckm, which plays a central role in energy transduction; 3- 
hydroxyisobutyrate dehydrogenase, which is part of the valine, leucine and 
isoleucine degradation pathway previously reported to be associated with 
subcutaneous fat pad mass in a different mouse cross [Ghazalpour et al. 
(2005)]; and the thiamine metabolism pathway gene Nsfl, which uses an 
endproduct of the steroid metabolism pathway. In addition, we identified 
several genes with no obvious ties to fat pad mass, energy metabolism, or 
obesity in general, including three riken cDNAs, Plekha8, Gloxdl, the sig- 
naling molecule Rras, and the transcription factor Foxj3. All of these are 
also present in our larger list of 27 genes ignoring sex dependent slopes. 
This larger list includes another transcription factor, an olfactory receptor 
gene, and an adhesion molecule. The olfactory receptor gene is particularly 
intriguing because it could affect feeding behavior in mice. 

9. Discussion. Lasso penalized regression performs continuous model se- 
lection by estimation rather than by hypothesis testing. Several factors con- 
verge to make penalized regression an ideal exploratory data analysis tool. 
One is the avoidance of the knotty issues of multiple testing. Another is 
the sheer speed of the coordinate descent algorithms. These algorithms of- 
fer decisive advantages in dealing with modern data sets where predictors 
wildly outnumber cases. If Fu (1998) had written his paper a few years 
later, this trend would have been clearer, and doubtless he would not have 
concentrated on small problems with cases outnumbering predictors. 

We would not have predicted beforehand that the normally plodding co- 
ordinate descent methods would be so fast. In retrospect, it is clear that 
their avoidance of matrix operations and quick dismissal of poor predictors 
make all the difference. Our initial concern about the adequacy of Edge- 
worth's algorithm have been largely laid to rest by the empirical evidence. 
On data sets with an adequate number of cases, the poor behavior reported 
in the past does not predominate for either version of coordinate descent. 

Although the supplementary appendix [Wu and Lange (2008)] proves that 
lasso constrained l\ regression is consistent, parameter estimates are biased 
toward zero in small samples. For this reason, once we have identified the 
active parameters for a given value of the tuning constant A, we re-estimate 
them ignoring the lasso penalty. Failure to make this adjustment tends to 
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favor smaller values of A in cross-validation and the inclusion of irrelevant 
predictors in the preferred model. 

Better understanding of the convergence properties of the algorithms is 
sorely needed. Tseng (2001) proves convergence of cyclic coordinate descent 
for I2 penalized regression. Our treatment of greedy coordinate descent in 
the Appendix is different and simpler. Neither proof determines the rate of 
convergence. Even more pressing is the challenge of overcoming the theoret- 
ical defects of Edgeworth's algorithm without compromising its speed. On a 
practical level, the reparameterization of Li and Arce (2004), which operates 
on pairs of parameters, may allow Edgeworth's algorithm to escape many 
trap points. If this tactic is limited to the active parameters, then speed may 
not degrade unacceptably. 

Our algorithm for grouped parameters exploits a majorization used in 
constructing other MM algorithms. The techniques and theory behind MM 
algorithms deserve to be better known [Hunter and Lange (2004)]. Majoriza- 
tion approximately doubles the number of iterations until convergence in the 
£2 cyclic coordinate descent algorithm. Although both the original and the 
grouped £2 algorithms take hundreds of iterations to converge, each iteration 
is so cheap that overall speed is still impressive. 

Lasso penalized estimation extends far beyond regression. The papers of 
Fu (1998) and Park and Hastie (2006a, 2006b) discuss some of the possibil- 
ities in generalized linear models. We have begun experimenting with cyclic 
coordinate descent in logistic regression. Although explicit maxima for the 
one-dimensional subproblems are not available, Newton's method converges 
reliably in a handful of steps. The results are very promising and will be 
dealt with in another paper. 

APPENDIX: CONVERGENCE THEORY 

Our proof of Proposition 1 splits into a sequence of steps. We first show 
that a minimum exists. This is a consequence of the continuity of f(ff) and 
the coerciveness property that Wmp^^^ f(8) = 00. If any \f3j\ tends to 00, 
then the claimed limit is obvious. If all (3j remain bounded but \n\ tends to 
00, then each squared residual — fx — x\j3) 2 tends to 00. 

Selection of which component of f(9) to update is governed by 

h(9) = minmm{d e J (9), d- e J (9)}. 

Although the function h{9) is not continuous, it is upper semicontinuous. 
This weaker property will be enough for our purposes. Upper semicontinuity 
means that limsup m ^ 00 h(9 rn ) < h(8) whenever 9 m converges to 9 [Rudin 
(1987)]. The collection of upper semicontinuous functions includes all contin- 
uous functions and is closed under the formation of finite sums and minima. 
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Every directional derivative of the residual sum of squares is continuous. 
Thus, to verify that h{9) is upper semicontinuous, it suffices to show that 
the coordinate directional derivatives of the penalty terms are upper 

semicontinuous. In a small enough neighborhood of (3j 7^ 0, all coordinate 
directional derivatives of \\(3j\ are constant and therefore continuous. At 
(3j = the directional derivatives along ej and — ej are both A, the maximum 
value possible. Hence, the limiting inequality of upper semicontinuity holds. 

Any discussion of convergence must take into account the stationary 
points of the algorithm. Such a point 9 satisfies the conditions d ej f{9) > 
and d- ej f(9) > for all j. If we let \x vary along the coordinate direction eo, 
then straightforward calculations produce the general directional derivative 

f) { v j> °j > °> 

= Etm:^H + a Ei ~> 9 J <0 

It follows that 



m 3 j>0 l\ Vi \, 0j = 0. 



d v f{9) = J2 d ej f(e) Vj + J2 d-ejmvjl 

Vj>0 Vj<0 

and that every directional derivative is nonnegative at a stationary point. 

Because f(0) is convex, the difference quotient s~ 1 [f(9 + sv) — f(6)] is 
increasing in s > 0. Therefore, f(6 + v) — f(9) > d v f(9), and if 6 is a sta- 
tionary point, then f(8 + v) > f(8) for all v. In other words, 9 is a minimum 
point. Conversely, it is trivial to check that d v f{6) > for every v when 6 is 
a minimum point. Hence, stationary points and minimum points coincide. 

With these preliminaries in mind, suppose the sequence 9 m generated by 
greedy coordinate descent has a subsequence 9 mk converging to a nonsta- 
tionary point 9*. By virtue of semicontinuity, we have 

(A.l) h{9 mk )<\h{9*)<Q 

for infinitely many k. We will demonstrate that this inequality forces the 
decreasing sequence f(9 m ) to converge to —00, contradicting the fact that 
f(9) is bounded below. In fact, we will demonstrate that there exists a 
constant c> with f(9 mk+1 ) < f(9 mk ) — c for all 9 mk satisfying inequality 
(A.l). The existence of the constant c is tied to the second derivatives 

l^/w =E4 

j 8=1 

of f(8) along each coordinate direction. Let b = maxj <Ya=i x %- 

Now suppose that 9 mk satisfies inequality (A.l). For notational simplicity, 
let y = 9j be the component being updated, x = 9™ k , and s(y) = f(9) as a 
function of 9j. Provided we restrict y to the side of showing the most 
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negative directional derivative, s(y) is twice differentiable and satisfies the 
majorization 

s(y) = s(x) + s'{x)(y - x) + \s"{z){y - xf 
< s(x) + s'(x)(y — x) + \b{y — x) 2 . 
If w = x — denotes the minimum of the majorizing quadratic, then 

s(w) < s(x) + s'(x)(w — x) + -b(w — x) 2 = s(x) — - — — — . 
At the minimum z of s(y) we have 

, ^ ls'(x) 2 
8(Z)<8(W)<8(X)---^-. 

This identifies the constant c as c = ^[^h(9*)] 2 and completes the proof of 
the proposition. 
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SUPPLEMENTARY MATERIAL 

Proof of weak consistency of Lasso penalized t\ regression 

(doi: 10.1214/07-AOAS147SUPP; .pdf). Our supplementary appendix demon- 
strates the weak consistency of penalized l\ estimators. The proof is a 
straightforward adaptation of the arguments of Oberhofer (1983) on non- 
linear l\ regression. Since we only consider linear models, the regularity 
conditions in Oberhofer (1983) are relaxed and clarified. 
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